Group 10: Hamza Siddiqui, Hridyansh Gupta, Maher Thakkar, Manas M Bhat and Parika Rawat

Course: BUDT704

Section: 0502

Date: 5th December 2022

Traffickers - Traffic Risk Analysis in Maryland¶

Traffic violations are acts of breaching traffic regulations that are intended to safeguard the lives of everyone in the vicinity of the vehicles. We are the Traffic Risk Analysis Team located in College Park, Maryland. We aim to analyze traffic incidents and violations in the United States to keep the roads safer. The way we keep them safer is to show our analysis to police departments and local government authorities in order for them to see where and why traffic violations/incidents occur. In this report, we are analyzing traffic violations that occur in Maryland with the majority taking place in Montgomery County. We acquired the dataset from Kaggle. This dataset shows traffic violations recorded from 2012 to 2018 with 1,811,977 violations recorded.

In order to structure our analysis to best help the local Police Department, we came up with 5 questions to be answered through our analysis.

Which areas in Maryland are most prone to accidents?¶

Using the longitude and latitude columns provided in the dataset, we are pinpointing where every violation involving an accident occurs in order to find where the most accidents occur. This is important for the police department to know because if most accidents occur within the Police Department’s jurisdiction, they can take notice of potentially high-risk traffic areas and deploy appropriate signs and warnings to drivers. They can also increase the number of speed cameras or local patrol in those areas to decrease accidents.

Are there any signs of racial bias in the police department traffic control?¶

Using the type of violation reporting system, technology with cameras, or actually human police officers, we can see whether or not there is any racial bias in who gets pulled over or indicated for a traffic violation.

We hypothesize that the technology reporting systems will indicate a near 1:1 ratio of violations for white drivers and non-white drivers. If this is true, then we can see whether or not the human reporting system shows different results to indicate any racial bias against non-white drivers. This is possible because in Maryland state, 57%[1] of residents are White which means that the number of white people in Maryland to non-white people is comparable.

This is important because racial bias being in the police department is detrimental to the community. Not only that, but from the police department's point of view, it would be terrible for their reputation in this day and age with all the racial tension in the United States for their police officers to be pulling over minorities at a disproportionate rate compared to white residents. In order to keep the trust of the community and keep roads safer, the police need to know of any potential indications of racial bias in their department.

Does a reckless driver get a warning or a citation?¶

  • This analysis uses our own discretion for what a reckless driver is:
  • The driver was involved in an accident.
  • The driver contributed to an accident.
  • The driver damaged property with their vehicle.
  • The driver was under the influence of alcohol.

This analysis can show us how effective the police department is enforcing traffic laws. A citation is more harmful to a driver than a warning but if the incident is harmful or dangerous enough, a citation is necessary in order to keep the roads safer. If citations are not given enough then the police department should acknowledge it in their training. We will also test for males vs. females on whether drivers get a citation or warning to see whether there is any bias there for the police department to clean up.

Which Gender causes the most accidents?¶

Using the gender and race column in the dataset, we can analyze which gender/race causes the most accidents. This, similarly to the area analysis, will add more information on how traffic accidents occur. We can see clearly whether one gender or race has more violations involving accidents than another. This is important to the police department as they can see which demographic cause more accidents, and further analysis can see where each demographic drives most often in order to take the same precautions as in the previous analysis with more patrol officers and cameras.

Predicting for a given set of vehicles will it cause an accident or not (using ML).¶

This analysis will use machine learning in order to predict whether a certain type of vehicle or vehicle will cause an accident or not. Using the dataset and its trends, we can predict this in order to show the police department the likelihood a vehicle causes an accident or not. This is important because if the police are flushed with calls as they can be and a violation is recorded on a speed camera or using technology, we can determine the chance that there was an accident to be reported with the violation. We can also use this machine learning model to determine whether a certain vehicle can be labeled a high-risk vehicle or not which will make the police department more aware when doing routine stops for violations. With high risk vehicles, we can give warnings to drivers in advance of accidents to keep roads safer.

DATA ANALYSIS¶

We have chosen the Data Analysis part because we were able to go above and beyond in the techniques and visualizations. We created a unique story through our analyses and how our projects analysis can help the police to do a better job in keeping the people of the community safe and secure and which more rules to apply. Our analysis is not just basic as we used different techniques to gather our answers including the use of follium, interactive plots, prediciting ML algorithoms with an accuracy of 97.4%.

For our analysis, We have identified a Traffic Violation dataset from Kaggle(https://www.kaggle.com/datasets/rounak041993/traffic-violations-in-maryland-county), from which we will be cleaning, analyzing and deriving insights in order to come up with useful conclusions

In [1]:
#Import the python libraries needed to run the code
import pandas as pd
import numpy as np
from numpy import nan as NA
import seaborn as sns
import matplotlib.pyplot as plt
import datetime as dt
import plotly.graph_objects as go
import folium
from folium import plugins
from folium.plugins import HeatMap
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
%matplotlib inline
#Set options to avoid displaying warnings while slicing dataframe
pd.options.mode.chained_assignment = None
pd.set_option('display.max_rows', 10)
In [2]:
#Import the data and display the first few rows of observation
traffic_violation_df = pd.read_csv('Traffic_Violation.csv')
traffic_violation_df.head()
C:\Users\Admin\AppData\Local\Temp\ipykernel_21932\403433003.py:2: DtypeWarning: Columns (19,20,21,22,23,24,25,34) have mixed types. Specify dtype option on import or set low_memory=False.
  traffic_violation_df = pd.read_csv('Traffic_Violation.csv')
Out[2]:
SeqID Date Of Stop Time Of Stop Agency SubAgency Description Location Latitude Longitude Accident ... Charge Article Contributed To Accident Race Gender Driver City Driver State DL State Arrest Type Geolocation
0 fbc324ab-bc8d-4743-ba23-7f9f370005e1 08/11/2019 20:02:00 MCP 2nd District, Bethesda LEAVING UNATTENDED VEH. W/O STOPPING ENGINE, L... CORDELL ST @ NORFOLK AVE. 38.989743 -77.097770 No ... 21-1101(a) Transportation Article False BLACK M SILVER SPRING MD MD A - Marked Patrol (38.9897433333333, -77.09777)
1 a6d904ec-d666-4bc3-8984-f37a4b31854d 08/12/2019 13:41:00 MCP 2nd District, Bethesda EXCEEDING POSTED MAXIMUM SPEED LIMIT: 85 MPH I... NBI270 AT MIDDLEBROOK RD 39.174110 -77.246170 No ... 21-801.1 Transportation Article False WHITE M SILVER SPRING MD MD A - Marked Patrol (39.17411, -77.24617)
2 54a64f6a-df28-4b65-a335-08883866aa46 08/12/2019 21:00:00 MCP 5th District, Germantown DRIVING VEH W/ TV-TYPE RECEIVING VIDEO EQUIP T... MIDDLEBROOK AN 355 39.182015 -77.238221 No ... 21-1129 Transportation Article False BLACK M GAITHERSBURG MD MD A - Marked Patrol (39.1820155, -77.2382213333333)
3 cf5479b6-9bc7-4216-a7b2-99e57ae932af 08/12/2019 21:43:00 MCP 5th District, Germantown DRIVING VEHICLE ON HIGHWAY WITH SUSPENDED REGI... GERMANTOWN RD AND ALE HOUSE 39.160508 -77.284023 No ... 13-401(h) Transportation Article False BLACK M GERMANTOWN MD MD A - Marked Patrol (39.1605076666667, -77.284023)
4 5601ca35-8ee7-4f8e-9208-d89cde96d469 08/12/2019 21:30:00 MCP 2nd District, Bethesda FAILURE OF LICENSEE TO NOTIFY ADMINISTRATION O... EASTWEST/ 355 38.984247 -77.090548 No ... 16-116(a) Transportation Article False BLACK M SILVER SPRING MD MD A - Marked Patrol (38.9842466666667, -77.0905483333333)

5 rows × 43 columns

From the above result, we see that our data set has 43 attributes. However, for our analysis, we do not need all of them and hence, some can be excluded. Also, to make it easier to understand the data set, let's reorder the columns in accordance to the analysis that we will be performing on this data.

In [3]:
#Retain only the columns needed for our analysis
traffic_violation_df = traffic_violation_df[['SeqID', 'Make', 'Model', 'VehicleType', 'Race', 'Accident','Fatal','Description', 'Gender', 'Date Of Stop','State', 'Year', 'Time Of Stop','Violation Type', 'DL State', 'Driver State', 'Personal Injury', 'Property Damage', 'Alcohol', 'Latitude', 'Longitude', 'Arrest Type']]

Each observation in our dataset contains an unique ID, which is the seqID. This column can be set as the index of our dataframe and can be used to identify individual rows.

In [4]:
#Set an index for our data
traffic_violation_df.set_index('SeqID').head(2)
Out[4]:
Make Model VehicleType Race Accident Fatal Description Gender Date Of Stop State ... Time Of Stop Violation Type DL State Driver State Personal Injury Property Damage Alcohol Latitude Longitude Arrest Type
SeqID
fbc324ab-bc8d-4743-ba23-7f9f370005e1 TOYOTA CAMRY 02 - Automobile BLACK No No LEAVING UNATTENDED VEH. W/O STOPPING ENGINE, L... M 08/11/2019 MD ... 20:02:00 Citation MD MD No No No 38.989743 -77.09777 A - Marked Patrol
a6d904ec-d666-4bc3-8984-f37a4b31854d HONDA CIVIC 02 - Automobile WHITE No No EXCEEDING POSTED MAXIMUM SPEED LIMIT: 85 MPH I... M 08/12/2019 MD ... 13:41:00 Citation MD MD No No No 39.174110 -77.24617 A - Marked Patrol

2 rows × 21 columns

Now, let's check how many observations we have for the 21 attributes that we have included in our dataset

In [5]:
#Determine the number of rows in our dataset
print('We have ' + str(len(traffic_violation_df))+' rows of data')
We have 1811977 rows of data

Now that we have the data set we require, let's proceed further and check if all the attributes hold valid observations or not.

In [6]:
#Calculate the number of rows for each variable which have a non-null value
traffic_violation_df.notnull().sum()
Out[6]:
SeqID              1811977
Make               1811910
Model              1811765
VehicleType        1811977
Race               1811977
                    ...   
Property Damage    1811977
Alcohol            1811977
Latitude           1811977
Longitude          1811977
Arrest Type        1811977
Length: 22, dtype: int64

From the above result, we see that the difference in the number of valid observations in few attributes. This could be due to the invalid or corrupted observations, duplicated data or mislabeled columns and many more. However, all of this, if not corrected, can be misleading towards an imprecise data analysis. Hence, it is crucial that we establish the correct data cleaning process which will increase the quality of our data.

Data Cleaning¶

Data Cleaning can be done in different techniques based on the problem that we face. To identify the same, we need to explore each column individually.

Step 1: Let's explore the manufacturer of the vechicles involved in the violations

In [7]:
#Count the number of times each Vehicle make is involved in a violation
traffic_violation_df['Make'].value_counts()
Out[7]:
TOYOTA      211188
HONDA       199886
FORD        166649
TOYT         99202
NISSAN       98115
             ...  
NEW              1
SABUA            1
EXPRESS          1
HYUNDAQI         1
JYUDAI           1
Name: Make, Length: 4457, dtype: int64

From the generated output, we can see that there are several misspells in our data set such as 'TOYOTA' having been entered as 'TOYT' and 'HYUNDAI' as 'HYUNDAQI'. Although, it is not possible to go through all the different brands(or misspells), we can generate a list of the top 60 brands based on the violation and then identify the names that have been misspelt

In [8]:
#Display a list of top 60 vehicle makes involved in a violation
traffic_violation_df['Make'].value_counts().head(60)
Out[8]:
TOYOTA      211188
HONDA       199886
FORD        166649
TOYT         99202
NISSAN       98115
             ...  
MINI          3581
LINC          3443
SUZUKI        3107
BUIC          3030
INFINITY      2954
Name: Make, Length: 60, dtype: int64

Now that we know how many different brand names have been mis-spelt, we can go ahead and replace them with the actual brand name

In [9]:
#Replace the top few mis-spelt brand names
traffic_violation_df['Make'].replace(to_replace=['TOYT','TYOTA','T0YOTA','TOY0TA','T0Y0TA','TOYOT','TOYOY','TOYATA','TOYTA','TOYOA','TOYA','TPYOTA','TOYO','TOYOTAA','TOY'], value="TOYOTA", inplace = True)
traffic_violation_df['Make'].replace(to_replace=['HOND','HINDA','HODNA', 'HYUNDA'], value="HONDA", inplace = True)
traffic_violation_df['Make'].replace(to_replace=['NISS', 'NISSIAN'], value="NISSAN", inplace = True)
traffic_violation_df['Make'].replace(to_replace=['CHEV', 'CHEVY', 'CHEVORLET'], value="CHEVROLET", inplace = True)
traffic_violation_df['Make'].replace(to_replace=['HYUN','HYUNDAQI', 'HYUND'], value="HYUNDAI", inplace = True)
traffic_violation_df['Make'].replace(to_replace=['MERC', 'MERZ', 'MERCEDES BENZ', 'MERCEDEZ', 'MER'], value="MERCEDES", inplace = True)
traffic_violation_df['Make'].replace(to_replace=['VW', 'VOLKS', 'VOLK', 'VOLKSWAGON', 'VOLKSWAGAN'], value="VOLKSWAGEN", inplace = True)
traffic_violation_df['Make'].replace(to_replace=['MAZD'], value="MAZDA", inplace = True)
traffic_violation_df['Make'].replace(to_replace=['VOLV'], value="VOLVO", inplace = True)
traffic_violation_df['Make'].replace(to_replace=['LEXS', 'LEXU', 'LEX'], value="LEXUS", inplace = True)
traffic_violation_df['Make'].replace(to_replace=['SUBA'], value="SUBARU", inplace = True)
traffic_violation_df['Make'].replace(to_replace=['CADI'], value="CADILLAC", inplace = True)
traffic_violation_df['Make'].replace(to_replace=['MITS'], value="MITSUBISHI", inplace = True)
traffic_violation_df['Make'].replace(to_replace=['INFI', 'INFINITY'], value="INFINITI", inplace = True)
traffic_violation_df['Make'].replace(to_replace=['CHRY', 'CHRYS'], value="CHRYSLER", inplace = True)
traffic_violation_df['Make'].replace(to_replace=['DODG'], value="DODGE", inplace = True)
traffic_violation_df['Make'].replace(to_replace=['ACUR'], value="ACURA", inplace = True)
traffic_violation_df['Make'].replace(to_replace=['PONT'], value="PONTIAC", inplace = True)
traffic_violation_df['Make'].replace(to_replace=['LINC'], value="LINCOLN", inplace = True)
traffic_violation_df['Make'].replace(to_replace=['BUIC', 'BUIK'], value="BUICK", inplace = True)
traffic_violation_df['Make'].value_counts().head(10)
Out[9]:
TOYOTA       318281
HONDA        267531
FORD         166649
NISSAN       138556
CHEVROLET    133656
HYUNDAI       62838
DODGE         59431
ACURA         55292
MERCEDES      53232
BMW           50442
Name: Make, dtype: int64

We can see a marginal increase in the number of brands being involved in traffic violations post cleaning this field, which would have otherwise been categorized as a different brand altogether.

Step 2: Let's explore the race of the driver mentioned in the dataset

We can begin to do so by checking the count of each race in the data.

In [10]:
#Calculate the count of each Race in our dataset
traffic_violation_df['Race'].value_counts()
Out[10]:
WHITE              622462
BLACK              576774
HISPANIC           398309
OTHER              107749
ASIAN              103366
NATIVE AMERICAN      3317
Name: Race, dtype: int64

There is a racial group known as 'Other.' All the other racial groupings are accurately displayed and identified. This column does not need to be cleaned because there are no NaN values and all the rows in this column are properly identifiable.

Step 3: Let's explore the genders of the drivers involved in the violations

In [11]:
#Calculate the count of each Gender in our dataset
traffic_violation_df['Gender'].value_counts()
Out[11]:
M    1218321
F     590975
U       2681
Name: Gender, dtype: int64

Here, M signifies Males, F signifies Females and U as Unique. Since, all the gender codes are valid, there is no need for the column to be cleaned

Step 4: Let's explore the DL state which holds the data of which state traffic violator's driving license were issued

Before we proceed further into this column, we may change the number of rows displayed by python in order to view multiple rows at the same time.

In [12]:
#Expanding the output display to see the entire data of columns
pd.set_option('display.max_rows', 100)
pd.set_option('display.width', 1000)

Now, let's find the number of unique states and the number of observations in each

In [13]:
# Finding the total number of unique DL State names
num_dl_states = traffic_violation_df['DL State'].nunique()
print(f'The total number of states in DL State column: {num_dl_states}')

#Displaying the count of each DL state
traffic_violation_df['DL State'].value_counts()
The total number of states in DL State column: 71
Out[13]:
MD    1576139
DC      60868
VA      59940
XX      25827
PA      10924
FL      10145
NY       8118
NC       6112
CA       5922
TX       4699
WV       4116
GA       4005
NJ       3844
MA       2768
OH       2275
DE       1966
IL       1959
SC       1795
WA       1431
MI       1368
AZ       1288
CT       1270
TN       1223
CO       1213
US        842
IN        768
AL        714
MO        698
LA        672
WI        481
MN        475
MS        455
NV        439
NM        426
KY        408
ME        406
OK        387
UT        382
RI        371
OR        338
NH        311
KS        294
VI        293
HI        271
IA        262
ON        240
AK        225
AR        215
ND        193
MT        175
ID        163
NE        155
VT        155
PR        123
MB        102
IT         74
SD         66
AB         44
WY         42
NB         34
QC         33
SK         26
BC         17
GU         16
PE         12
AS          9
NS          7
PQ          6
MH          5
NF          2
YT          1
Name: DL State, dtype: int64

From the above result, we see that in the 'DL State' column, there are 70 different States and it contains several wrong state name abbreviations such as 'AB','BC','IT', 'MB','MH','NB','NF','NS','ON','PE','PQ','QC','SK','US' and 'XX'. [2]
Now we have 2 options:
(a) To replace the incorrect state abbreviations with the valid ones.
(b) To replace the incorrect state abbreviations with 'XX'.

However, we'll proceed with option(b) beacuse we have no basis to predict which incorrect abbreviation corresponds to valid abbreviation. For example: 'AB' is an incorrect abbreviation and if we were to replace it with a valid abbreviation, we wouldn't know which valid abbreviation to choose from : 'AK', 'AL', 'AR', 'AS' or 'AZ'.

Also, there are many valid abbeviations starting with the same letter so we cannot replace the invalid abbreviation with a valid abbreviation.

Thus, we choose to replace the incorrect state abbreviations with 'XX'.

Now that we know how many different state abbreviations have been mis-spelt in DL State column, we can go ahead and replace them with 'XX'

In [14]:
#Replace mis-spelt state names with XX
traffic_violation_df['DL State'].replace(to_replace=['AB','BC','IT', 'MB','MH','NB','NF','NS','ON','PE','PQ','QC','SK','US'], value="XX", inplace = True)

We can count the number of states post replacement to verify if there is a reduction in the number.

In [15]:
# Finding the total number of unique DL State names
valid_num_dl_states = traffic_violation_df['DL State'].nunique()
print(f'The total number of valid states in DL State column: {valid_num_dl_states}')
The total number of valid states in DL State column: 57

Therefore, after cleaning the 'DL State' column, we have 50 valid states from the USA, 6 insular states which are AS GU MP PR VI UM,and Washington DC

Step 5: Let's explore Drivers State which signifies the state of the driver’s home address.

This is similar to the DL state column we cleaned, hence we can also replace the invalid state names with 'XX'

In [16]:
#Replace mis-spelt state names with XX
traffic_violation_df['Driver State'].replace(to_replace=['AB','BC','IT', 'MB','NB','NF','NS','ON','PE','PQ','QC','SK','US'], value="XX", inplace = True)

#Displaying the count of each Driver state
traffic_violation_df['Driver State'].value_counts().sort_index(ascending=True)
Out[16]:
AK        103
AL        500
AR        136
AZ        581
CA       3353
CO        680
CT        896
DC      59933
DE       1628
FL       6320
GA       2486
GU          7
HI        139
IA        108
ID         76
IL       1204
IN        564
KS        165
KY        291
LA        414
MA       1808
MD    1635281
ME        236
MI        864
MN        281
MO        433
MS        322
MT         88
NC       4268
ND        145
NE         94
NH        200
NJ       2842
NM        285
NV        265
NY       5592
OH       1591
OK        270
OR        182
PA       9202
PR         34
RI        232
SC       1122
SD         41
TN        730
TX       2724
UT        224
VA      56366
VI          8
VT        104
WA        867
WI        290
WV       3956
WY         35
XX       1400
Name: Driver State, dtype: int64

From the above generated list, we can observe that the state names have been cleaned and the only invalid state is 'XX', which we shall retain for our analysis

Step 6: Let's explore the Location, Latitutde and Longitude of the traffic violation

Latitude and Longitude are geographical coordinates on the Earth. Latitude is based off the distance from the North and South pole and represents a vertical coordinate while longitude is based on the Prime Meridian as its standard 0. Both Latitude and Longitude are both represented in degrees from their respective markers. In this data set, the latitude and longitude are used to put the coordinates for every traffic violation recorded.

The data for latitude and longitude does have some cleaning to do as some of the latitudes and longitudes fall outside the range of coordinates that are in the state of Maryland. Since we cannot interpret locations in order to impute data, we need to filter out the bad data points by dropping them. In this case, we classify valid coordinates that have latitudes between 37 degrees and 40 degrees. For longitude, we classified valid coordinates as between -82 degrees and -75 degrees. This ranges are based on research done at https://www.mapsofworld.com/usa/states/maryland/lat-long.html which shows data for latitude and longitude of major cities and locations in Maryland.

There are also instances of NaN values for latitude and longitude which means there is missing data. Again, since we cannot assume to inpute any data into these columns, the only solution is to either drop them or leave them. Since the number of NaNs is negligble compared to the rest of the data at about 7.9%, we decided to drop them as well which was all done in one step per column.

In [17]:
#Set the number of rows to be displayed back to default(10)
pd.set_option('display.max_rows', 10)
pd.set_option('display.width', 10)

#Converting latitude and longitude to the required format
lattitude_ = traffic_violation_df['Latitude'].astype(float)
longitude_ = traffic_violation_df['Longitude'].astype(float)

#Filtering out the Latitude and Longitude datapoints that do not fall within Marylands range
traffic_violation_df= traffic_violation_df[(traffic_violation_df['Latitude']>37) & (traffic_violation_df['Latitude']<40)]
traffic_violation_df = traffic_violation_df[(traffic_violation_df['Longitude'] > -82) & (traffic_violation_df['Longitude'] < -75)]

Now that we have exlpored the main attributes required for our data analysis, we still feel that it can be improved through involving some more meangingful data

Step 7: Use of dummy variables for categorical variables
Having a dummy variable instead of the text in categorical vaiables, such as 'Accident', 'Personal Injury', 'Property Damage' and 'Alocohol' would make it easier to calulate the total number. Hence, we replace all the values that are 'No' with a 0 and all the values that are 'Yes' with 1.

In [18]:
#Set categorical variables
no = 0
yes = 1

#Assign categorical variables for each column.
traffic_violation_df['Accident']=traffic_violation_df['Accident'].map(lambda x: yes if x=='Yes' else no)
traffic_violation_df['Personal Injury']=traffic_violation_df['Personal Injury'].map(lambda x: yes if x =='Yes' else no)
traffic_violation_df['Property Damage']=traffic_violation_df['Property Damage'].map(lambda x: yes if x =='Yes' else no)
traffic_violation_df['Alcohol']=traffic_violation_df['Alcohol'].map(lambda x: yes if x =='Yes' else no)

traffic_violation_df.head(2)
Out[18]:
SeqID Make Model VehicleType Race Accident Fatal Description Gender Date Of Stop ... Time Of Stop Violation Type DL State Driver State Personal Injury Property Damage Alcohol Latitude Longitude Arrest Type
0 fbc324ab-bc8d-4743-ba23-7f9f370005e1 TOYOTA CAMRY 02 - Automobile BLACK 0 No LEAVING UNATTENDED VEH. W/O STOPPING ENGINE, L... M 08/11/2019 ... 20:02:00 Citation MD MD 0 0 0 38.989743 -77.09777 A - Marked Patrol
1 a6d904ec-d666-4bc3-8984-f37a4b31854d HONDA CIVIC 02 - Automobile WHITE 0 No EXCEEDING POSTED MAXIMUM SPEED LIMIT: 85 MPH I... M 08/12/2019 ... 13:41:00 Citation MD MD 0 0 0 39.174110 -77.24617 A - Marked Patrol

2 rows × 22 columns

ANALYSIS¶

Analysis 1: Which Cities in the state of Maryland contribute to the most number of Accidents?¶

We created a heatmap that depicts the density of accidents in the state of Maryland to assist the police department in determining where they should focus their efforts in reducing accidents.
The number of accidents is indicated by the color code:
Green: Low accident risk region
Yellow: Medium accident risk region
Red: High accident risk region

In [19]:
# Using folium to create a heat map and entering location coordinates of Maryland State
traffic_violation_df_map = folium.Map(location=[39.045753, -76.641273],
                    zoom_start = 10) 

# Need to convert all values in the columns into float
traffic_violation_df['Latitude'] = traffic_violation_df[traffic_violation_df['Accident']==1]['Latitude'].astype(float)
traffic_violation_df['Longitude'] = traffic_violation_df[traffic_violation_df['Accident']==1]['Longitude'].astype(float)

# Filtering the Dataframe for rows, then columns, then remove NaNs
heat_df = traffic_violation_df[traffic_violation_df['Accident']==1][['Latitude', 'Longitude']]
heat_df = heat_df.dropna(axis=0, subset=['Latitude','Longitude'])

# Using a list comprehension to make a list of lists
heat_data = [[row['Latitude'],row['Longitude']] for index, row in heat_df.iterrows()]

# Plotting it on the map
HeatMap(heat_data,min_opacity=0.2).add_to(traffic_violation_df_map)

# Displaying the map
traffic_violation_df_map
Out[19]:
Make this Notebook Trusted to load map: File -> Trust Notebook

An observation from the above visualization is that Glenmont contributes to the maximum number of Accidents.

Analysis 2: Is there any racial bias in Maryland traffic police?¶

One question that came up while investigating the dataset was whether or not racial bias is shown in the number of traffic violations based on race. In order to test this theory, we focused on two columns ; the Arrest Type column and the Race column. First we cleaned the Arrest Type column by eliminating the unncessary letter codes for each type of arrest, then we put each arrest type into 2 groups; human stops and technology stops. We determined that the traffic violations were occuring by either an actual police or person of authority conducting the stops, or a camera/radar or some other form of technology recording the violations.

For the human recorded violations, we determined that they were :

'Marked Patrol' 'Foot Patrol' 'Unmarked Patrol' 'Motorcycle' 'Marked (Off-Duty)' 'Mounted Patrol' 'Unmarked (Off-Duty)'

For the technology recorded violations, we determined that they were:

'Marked Laser' 'Marked Stationary Radar' 'License Plate Recognition' 'Unmarked VASCAR' 'Marked Moving Radar (Moving)' 'Unmarked Moving Radar (Stationary)' 'Marked Moving Radar (Stationary)' 'Unmarked Stationary Radar' 'Marked VASCAR' 'Unmarked Moving Radar (Moving)'

We also determined that 'Aircraft Assistance' was not one way or the other as the assistance implies that it is not the only thing working towards recording the violation.

The reason to divide the violations into human and technology is to use technology as a base for comparison. Technology cannot have racial bias unless it is programmed into it which is highly doubtful as it has no place in traffic violations. So we hypthosesized that the ratio between non-white and white traffic violations from technology stops would be close to 1.

We also hypothesized that non-white races (being Hispanic, Black, Asian, and Native American) would be targeted more during traffic violations and thus yield a high ratio of non-white to white traffic violations from human recordings.

In [20]:
#Check the unique arrest types in our dataset
traffic_violation_df['Arrest Type'].unique()
Out[20]:
array(['A - Marked Patrol', 'L - Motorcycle', 'Q - Marked Laser',
       'I - Marked Moving Radar (Moving)', 'B - Unmarked Patrol',
       'F - Unmarked Stationary Radar', 'R - Unmarked Laser',
       'G - Marked Moving Radar (Stationary)',
       'E - Marked Stationary Radar', 'O - Foot Patrol',
       'H - Unmarked Moving Radar (Stationary)', 'M - Marked (Off-Duty)',
       'J - Unmarked Moving Radar (Moving)', 'N - Unmarked (Off-Duty)',
       'S - License Plate Recognition', 'C - Marked VASCAR',
       'P - Mounted Patrol', 'D - Unmarked VASCAR', 'K - Aircraft Assist'],
      dtype=object)
In [21]:
#Clean the data for letter codes and whitespace 
traffic_violation_df['Arrest Type']= traffic_violation_df['Arrest Type'].str.replace('(\D{1}\s\-\s)','')
C:\Users\Admin\AppData\Local\Temp\ipykernel_21932\634631987.py:2: FutureWarning: The default value of regex will change from True to False in a future version.
  traffic_violation_df['Arrest Type']= traffic_violation_df['Arrest Type'].str.replace('(\D{1}\s\-\s)','')
In [22]:
#Classify arrest types as human or tech 
human_check = ['Marked Patrol','Foot Patrol','Unmarked Patrol','Motorcycle','Marked (Off-Duty)','Mounted Patrol','Unmarked (Off-Duty)']
tech_check = ['Marked Laser','Marked Stationary Radar','Unmarked Laser','License Plate Recognition','Unmarked VASCAR','Marked Moving Radar (Moving)','Unmarked Stationary Radar','Marked VASCAR','Unmarked Moving Radar (Moving)']
In [23]:
#Create different dataframes for those checked by humans and tech
human_check_df = traffic_violation_df[traffic_violation_df['Arrest Type'].isin(human_check)]
tech_check_df = traffic_violation_df[traffic_violation_df['Arrest Type'].isin(tech_check)]
In [24]:
#Display the unique races
traffic_violation_df.Race.unique()
Out[24]:
array(['BLACK', 'WHITE', 'HISPANIC', 'OTHER', 'ASIAN', 'NATIVE AMERICAN'],
      dtype=object)
In [25]:
#Classify the non-white races and calculate the numbers of pullovers  
non_white_races = ['BLACK','HISPANIC','ASIAN','NATIVE AMERICAN']
print(f'Non white pullovers by tech {tech_check_df.Race.isin(non_white_races).sum()}')
print(f'Non white pullovers by human {human_check_df.Race.isin(non_white_races).sum()}')
print(f'White pullovers by tech {(tech_check_df.Race=="WHITE").sum()}')
print(f'White pullovers by human {(human_check_df.Race=="WHITE").sum()}')
Non white pullovers by tech 99640
Non white pullovers by human 901066
White pullovers by tech 89078
White pullovers by human 480048
In [26]:
#Calclate the ratio of pullovers for different races
non_white_tech = tech_check_df.Race.isin(non_white_races).sum()
non_white_human = human_check_df.Race.isin(non_white_races).sum()
white_tech = (tech_check_df.Race=="WHITE").sum()
white_human = (human_check_df.Race=="WHITE").sum()
tech_others_white_ratio = (non_white_tech/white_tech) 
human_others_white_ratio = (non_white_human/white_human)
In [27]:
#Display the ratio results
print(f'The ratio of non-white traffic violations to whites recorded by humans is {human_others_white_ratio:.2f} which is nearly 2:1.')
print(f'The ratio of non-white Traffic violations to whites recored by technology is {tech_others_white_ratio:.2f} which is nearly 1, making it even.')
The ratio of non-white traffic violations to whites recorded by humans is 1.88 which is nearly 2:1.
The ratio of non-white Traffic violations to whites recored by technology is 1.12 which is nearly 1, making it even.
In [28]:
#Display the results in a tabular form
racial_bias_df = pd.DataFrame({'Arrest Type':['Tech','Tech', 'Human','Human'], 'Race':['White','Non-White','White','Non-White'],'Number of Violations': [white_tech,non_white_tech,white_human,non_white_human]}, index = [1,2,3,4])
racial_bias_df
Out[28]:
Arrest Type Race Number of Violations
1 Tech White 89078
2 Tech Non-White 99640
3 Human White 480048
4 Human Non-White 901066
In [29]:
#Plot a graph of pullovers by tech and human for people of different races
import plotly.graph_objects as go

fig = go.Figure(data=[
    go.Bar(name='White', x=racial_bias_df[racial_bias_df['Race'] == 'White']['Arrest Type'], y=racial_bias_df[racial_bias_df['Race'] == 'White']['Number of Violations']),
    go.Bar(name='Non-white', x=racial_bias_df[racial_bias_df['Race'] == 'Non-White']['Arrest Type'], y=racial_bias_df[racial_bias_df['Race'] == 'Non-White']['Number of Violations']),
])
fig.update_layout(
    title= "Human VS Tech Racial Profiling",
    xaxis_title="Arrest Type",
    yaxis_title="Count",
    legend_title="Legends",
    font=dict(
        family="Courier New, monospace",
        size=18,
        color="RebeccaPurple"
    ))
fig.show()

Our results have found that the tech recording yielded a ratio of 0.98 white to non-white traffic violations which is on par with our hypothesis. The human recording showed a ratio of 1.7 which means it is almost 2:1 the ratio between non-white traffic violations and white traffic violations recorded by humans. This graph shows the difference in raw sums and as you can see, the tech arrest type yieleded nearly even results while the human recording was overwhelmingly less for whites compared to non-whites. Thus, we can conclude that there is racial bias in traffic violations in Maryland based on this dataset.

We can infer that police officers in Maryland feel the need to stop non-white drivers rather than white drivers due to personal bias. We can't say that every traffic police officer in the area is 'racist' but we can infer clear bias from the group as a whole.

Analysis 3: Does a reckless driver get a warning or a citation and does the warning rate differ between males and females?¶

Now let's assess whether a reckless driver gets a warning or a citation? We will also assess if there is a difference in warning rates between males and females in the dataset.

A reckless driver can be distinguished by atleast one of the following events:

  1. The driver was involved in an accident.
  2. The driver contributed to an accident.
  3. The driver damaged property with his/her vehicle.
  4. The driver was under the influence of alcohol.
In [30]:
# Calculate the count of reckless drivers in the dataset
reckless_drivers_df = traffic_violation_df.loc[(traffic_violation_df['Accident'] == 1)  | (traffic_violation_df['Personal Injury'] == 1) | (traffic_violation_df['Property Damage'] == 1) | (traffic_violation_df['Alcohol'] == 1)]
count_reckless_drivers = len(reckless_drivers_df)
# Calculate the count of reckless drivers that received warning
warning_rd_df = reckless_drivers_df.where(reckless_drivers_df['Violation Type'] == 'Warning').dropna()
reckless_drivers_warning = len(warning_rd_df)

# FEMALE
# Count of female reckless drivers
female_reckless_drivers = len(reckless_drivers_df.where(reckless_drivers_df['Gender'] == 'F').dropna())
# Count of female reckless drivers that received warnings
female_warnings = len(warning_rd_df.where(warning_rd_df['Gender'] == 'F').dropna())
# Count of female reckless drivers that received citations
female_citations = female_reckless_drivers - female_warnings

# Count of male reckless drivers
male_reckless_drivers = len(reckless_drivers_df.where(reckless_drivers_df['Gender'] == 'M').dropna())
# Count of male reckless drivers that received warnings
male_warnings = len(warning_rd_df.where(warning_rd_df['Gender'] == 'M').dropna())
# Count of male reckless drivers that received citations
male_citations = male_reckless_drivers - male_warnings

print(f'Out of {len(traffic_violation_df)} violations, there are {count_reckless_drivers} reckless drivers and out of these, {reckless_drivers_warning} received a warning instead of a citation')
print(f'If we look further into it, we see that the female warning rate is {(female_warnings/female_reckless_drivers)*100:.2f}% whereas male warning rate is {(male_warnings/male_reckless_drivers)*100:.2f}%')
Out of 1685941 violations, there are 73306 reckless drivers and out of these, 3179 received a warning instead of a citation
If we look further into it, we see that the female warning rate is 9.53% whereas male warning rate is 6.40%
In [31]:
# Let's visualize the result
from matplotlib import rcParams
df = pd.DataFrame(dict(
   x=['Male', 'Female'],
   y1=[male_citations, female_citations],
   y2=[male_warnings, female_warnings]
))
bar_plot1 = sns.barplot(x='x', y='y1', data=df, label="Citations", color="c")
bar_plot2 = sns.barplot(x='x', y='y2', data=df, label="Warnings", color="m")
bar_plot1.set(xlabel='GENDER', ylabel='COUNT')
plt.title("Reckless Drivers: Warning vs Citation")
sns.set(style="darkgrid")
rcParams['figure.figsize'] = 5,5
plt.legend()
plt.show()

Analysis 4: Who are the worst drivers based on race and gender?¶

Analyzing if there is any correlation between race and gender of a person with accidents. Also understanding if there are any reason for the correlation

In [32]:
#Create a new dataframe where drivers are male and involved in accidents
df_new_male_accident = traffic_violation_df[(traffic_violation_df['Gender'] == 'M')&(traffic_violation_df['Accident']==1)]
#Create a new dataframe where driers are females and involved in accidents
df_new_female_accident = traffic_violation_df[(traffic_violation_df['Gender'] == 'F')&(traffic_violation_df['Accident']==1)]

Now we are taking the max race and max people in the gender which is white race and male gender respectively, so we are creating a new dataframe which has male people who have made accidents and now we will want to find out how many of those male people are white in comparison to other races.

In [33]:
#Counting males who are white and converting to percentage
no_of_rows_male_white=len(df_new_male_accident.index)

count_male_white=(df_new_male_accident['Race']=='WHITE').sum()
male_white_percentage=(count_male_white/no_of_rows_male_white)*100
print(f'Frequency of male gender who is of white race to cause an accident is {count_male_white} which is {male_white_percentage:.2f}%')

count_male_black=(df_new_male_accident['Race']=='BLACK').sum()
male_black_percentage=(count_male_black/no_of_rows_male_white)*100
print(f'Frequency of male gender who is of black race to cause an accident is {count_male_black} which is {male_black_percentage:.2f}%')

count_male_hispanic=(df_new_male_accident['Race']=='HISPANIC').sum()
male_hispanic_percentage=(count_male_hispanic/no_of_rows_male_white)*100
print(f'Frequency of male gender who is of hispanic race to cause an accident is {count_male_hispanic} which is {male_hispanic_percentage:.2f}%')

count_male_other=(df_new_male_accident['Race']=='OTHER').sum()
male_other_percentage=(count_male_other/no_of_rows_male_white)*100
print(f'Frequency of male gender who is of other race to cause an accident is {count_male_other} which is {male_other_percentage:.2f}%')

count_male_asian=(df_new_male_accident['Race']=='ASIAN').sum()
male_asian_percentage=(count_male_asian/no_of_rows_male_white)*100
print(f'Frequency of male gender who is of asian race to cause an accident is {count_male_asian} which is {male_asian_percentage:.2f}%')

count_male_native_american=(df_new_male_accident['Race']=='NATIVE AMERICAN').sum()
male_native_american_percentage=(count_male_native_american/no_of_rows_male_white)*100
print(f'Frequency of male gender who is of native american race to cause an accident is {count_male_native_american} which is {male_native_american_percentage:.2f}%')
Frequency of male gender who is of white race to cause an accident is 9421 which is 31.74%
Frequency of male gender who is of black race to cause an accident is 7556 which is 25.46%
Frequency of male gender who is of hispanic race to cause an accident is 9799 which is 33.02%
Frequency of male gender who is of other race to cause an accident is 1500 which is 5.05%
Frequency of male gender who is of asian race to cause an accident is 1333 which is 4.49%
Frequency of male gender who is of native american race to cause an accident is 70 which is 0.24%
In [34]:
#Counting males who are white and converting to percentage
no_of_rows_female_white=len(df_new_female_accident.index)

count_female_white=(df_new_female_accident['Race']=='WHITE').sum()
female_white_percentage=(count_female_white/no_of_rows_female_white)*100
print(f'Frequency of female gender who is of white race to cause an accident is {count_female_white} which is {female_white_percentage:.2f}%')

count_female_black=(df_new_female_accident['Race']=='BLACK').sum()
female_black_percentage=(count_female_black/no_of_rows_female_white)*100
print(f'Frequency of female gender who is of black race to cause an accident is {count_female_black} which is {female_black_percentage:.2f}%')

count_female_hispanic=(df_new_female_accident['Race']=='HISPANIC').sum()
female_hispanic_percentage=(count_female_hispanic/no_of_rows_female_white)*100
print(f'Frequency of female gender who is of hispanic race to cause an accident is {count_female_hispanic} which is {female_hispanic_percentage:.2f}%')

count_female_other=(df_new_female_accident['Race']=='OTHER').sum()
female_other_percentage=(count_female_other/no_of_rows_female_white)*100
print(f'Frequency of female gender who is of other race to cause an accident is {count_female_other} which is {female_other_percentage:.2f}%')

count_female_asian=(df_new_female_accident['Race']=='ASIAN').sum()
female_asian_percentage=(count_female_asian/no_of_rows_female_white)*100
print(f'Frequency of female gender who is of asian race to cause an accident is {count_female_asian} which is {female_asian_percentage:.2f}%')

count_female_native_american=(df_new_female_accident['Race']=='NATIVE AMERICAN').sum()
female_native_american_percentage=(count_female_native_american/no_of_rows_female_white)*100
print(f'Frequency of female gender who is of native american race to cause an accident is {count_female_native_american} which is {female_native_american_percentage:.2f}%')
Frequency of female gender who is of white race to cause an accident is 5549 which is 41.27%
Frequency of female gender who is of black race to cause an accident is 3484 which is 25.91%
Frequency of female gender who is of hispanic race to cause an accident is 2546 which is 18.93%
Frequency of female gender who is of other race to cause an accident is 809 which is 6.02%
Frequency of female gender who is of asian race to cause an accident is 1041 which is 7.74%
Frequency of female gender who is of native american race to cause an accident is 18 which is 0.13%
In [35]:
#Initialize a list with all unique races
data = traffic_violation_df['Race'].unique()
  
#Create a DataFrame with count of accidents by race and gender
new_df = pd.DataFrame(data, columns=['Race'])
count_accidents_made_by_males = [count_male_black, count_male_white, count_male_hispanic, count_male_other, count_male_asian, count_male_native_american]
count_accidents_made_by_females = [count_female_black , count_male_white,  count_female_hispanic, count_female_other, count_female_asian, count_female_native_american]
new_df ['Count_Male_Accidents'] = count_accidents_made_by_males
new_df ['Count_Female_Accidents'] = count_accidents_made_by_females
# Create a pivot table to display the data
new_df_pivot = pd.pivot_table(new_df, values = ['Count_Male_Accidents', 'Count_Female_Accidents'], columns = 'Race')
new_df_pivot
Out[35]:
Race ASIAN BLACK HISPANIC NATIVE AMERICAN OTHER WHITE
Count_Female_Accidents 1041 3484 2546 18 809 9421
Count_Male_Accidents 1333 7556 9799 70 1500 9421
In [36]:
#Create a line chart to visualize this data [3]
import plotly.offline as pyo
layout = go.Layout(title = 'Male count vs Female count making accidents')

traces =[go.Scatter(
    x = new_df_pivot.columns,
    y= new_df_pivot.loc[rowname],
    mode = 'markers+lines',
    name = rowname
)for rowname in new_df_pivot.index]

figure = go.Figure (data = traces, layout=layout)
figure.update_layout(
    xaxis_title="Race",
    yaxis_title="Count",
    legend_title="Legends",
    font=dict(
        family="Courier New, monospace",
        size=18,
        color="RebeccaPurple"
    ))

figure.show()

The observations from the graph are:

(i) Hispanic Males followed closely by White Males and Females cause the most number of accidents.

(ii) Native American Males and Females cause the least number of accidents.

Analysis 5: Now that we have understood which all factors contribute to an accident, Let's predict if there will be an accident based on these parameters.¶

We will use a machine learning algortithm for this prediction¶

Machine learning uses a data set to understand and then builds a model to leverage that data to to improve performance and predict data points. Machine Learning models can be categorized as supervised or unsupervised.

Supervised leaning is an approach that uses labelled datasets which are then classified to predict the outcomes accurately. The model measures the accuracy using the input-output pair. It is further divided into two types:

● Regression – This method understands the relationship between the dependent and independent variables and predict numerical values based on data points. The output is continous

Below are some commonly used regression models:

  1. Linear Regression – Here, the ‘line of best fit’ represents a dataset and is found by minimizing the squared error (squared distance between the line of best fit and points) This helps us in predicting the future points and identifying outliers.

  2. Decision Tree – Here, the dataset is continuously spilt based on a certain parameter. Each node is further divided into multiple nodes. The last node is where the decision is made is the leaves of the tree. The more nodes you have, the better accuracy you get.

  3. Random Forest – Here, it involves multiple decision trees based on different samples and takes the majority vote in case of classification and averaging in case of regression

  4. Neural Network – Here, it takes one or more inputs and goes through a network of equations, resulting in one or more outputs

● Classification – The output is discrete. Commonly used models are Logistic Regression, Super Vector Machine, Naive Bayes, Decision Tree, Random Forest, and Neural Network

Unsupervised Learning uses algorithms to analyze and cluster unlabeled data sets. These algorithms discover hidden patterns in data without the need for human intervention. [4]

Now that we have understood how each model works and are different from one another, we land to a decision that for our predictions, we will use "Random Forest" model. Random forest can handle both continuous and categorical variables and since we have to predict a categorical variable, we use Random forest as it performs better and gives better results.

To predict the accident, we are using the following parameters: 'Race', 'Gender', 'State', 'Year', 'Make', 'DL State' The data set does contain more attributes, however they do not contribute to an accident or are the after effects.

We are storing the parameters responsible for an accident in a variable X which are the independent data points and we are using Y as the dependent variable which is "Accident"

In [37]:
#Define the dependent and independent variables
X = traffic_violation_df[['Race', 'Gender', 'State', 'Year', 'Make', 'DL State']]
y= traffic_violation_df['Accident']

Since Machine Learning models cannot handle text, we have to map all the values in our attributes to some integers.

Gender - We map all Males to 2, all Females to 1 and all Unknowns become 0

Race - We map Black to 0, White to 1, Hispanic to 2, Asian to 3, Native American to 4 and all others become 5

DL state - We map all Maryland drivers to 1 and all the rest of the states' drivers become NaN by default. We will handle this in the further session.

In [38]:
#Map the genders to numeric values
gender_categories = {"M": 2, "F": 1, "U":0}
X['Gender']= X['Gender'].map(gender_categories)

#Map the races to numeric values
race_categories = {"BLACK": 0, 'WHITE':1, "HISPANIC": 2, "ASIAN":3, 'NATIVE AMERICAN':4 , 'OTHER':5}
X['Race']= X['Race'].map(race_categories)

#Map the states to numeric values
state_categories = {"MD": 1}
X['State']= X['State'].map(state_categories)
X['DL State']= X['DL State'].map(state_categories)

Now let's looking into a given set of vehichles and understand which all car models hold the most accidents

In [39]:
X['Make'].value_counts().head(15)
Out[39]:
TOYOTA        296795
HONDA         248882
FORD          154847
NISSAN        128866
CHEVROLET     124063
               ...  
VOLKSWAGEN     39107
LEXUS          38787
JEEP           38665
MAZDA          32452
SUBARU         24899
Name: Make, Length: 15, dtype: int64

From the above result, we see that majority of our data set hold the top 15 car brands. Hence, for our analysis, we only use the top 15 car brands to be precise and accurate.

In [40]:
#Map the vehicle make to numeric values
make_categories = {"TOYOTA": 1, 'HONDA':2, 'FORD': 3, 'NISSAN':4, 'CHEVROLET': 5, 'HYUNDAI':6,'DODGE':7, 'ACURA':8, 'MERCEDES':9, 'BMW':10, 'VOLKSWAGEN':11, 'JEEP':12, 'LEXUS':13, 'MAZDA':14, 'SUBARU':15}
X['Make']= X['Make'].map(make_categories)

Since, Machine Learning models cannot handle NaN values. We will convert all NaNs to 0 in Make, State and DL state. However, we cannot proceed with the same for "Years" since a 0 value will confuse the models as the year ranges between 1900-2020. Hence, we fill the NaN years with the mean of the years

In [41]:
#Fill in the missing values
X['Make'] = X['Make'].fillna(0)
X['State']= X['State'].fillna(0)
X['DL State']= X['DL State'].fillna(0)
X['Year']= X['Year'].fillna(X['Year'].mean())

Let's proceed with the next step and split the dataset into test and train data for ML. Training model holds the 70% of the data and testing model holds the rest 30%

In [42]:
#Split the data into training and testing set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1, stratify=y)

Before we initialize the algorithm, we need to provide the number of decision trees we need in our alogrithm. In our case, regardless of using 10 or 100, the accuracy seems to be the same. And as we increase the number of decision trees, it takes longer to process and thus, the systems slows down. So, to avoid that, we will go with 10 as our n_estimators

In [43]:
#Initialize the ML algorithm
forest = RandomForestClassifier(n_estimators=10)

As our next step, we are using the 70% of our data to train the model so that it can then build an algorithm to predict the rest

In [44]:
#Fit the training data into the ML Model
forest.fit(X_train, y_train)
Out[44]:
RandomForestClassifier(n_estimators=10)

Since the remaining 30% of our data is for testing the model, we will use that in "Random Forest" to predict those 30% of the data points

In [45]:
#Predict the output
y_pred = forest.predict(X_test)
y_pred
Out[45]:
array([0, 0, 0, ..., 0, 0, 0], dtype=int64)

Now that we have our 30% of the data predicted through "Random Forest" model, we wil compare those with the actual data points and determine the accuracy of the model

In [46]:
#Calculate the accuracy of our model
print(f'Accuracy: {accuracy_score(y_test, y_pred)*100:.2f}% ')
Accuracy: 97.44% 

The accuracy is nothing but the ratio of number of predictions that were correctly predicted to the total number of all the predictions, which in our case is 97.44%. This implies that 97.44% of all the predictions made by our model were correct.

In [47]:
#Generate a confusion matrix
conf_matrix = confusion_matrix(y_test, y_pred)
plt.title('Confusion Matrix')
plot = sns.heatmap(conf_matrix, annot=True, fmt="d")
plot.set(xlabel='Predicted Values', ylabel='Actual Values')
plt.show()

From the confusion matrix, we can see that we have 492663 violations that were predicted to not result in an accident and in reality it did not and, 177 violations that were predicted to result in an accident and they did. However, there were also 12773 values of accident that were predicted to not to be an accident but resulted in accident and 170 violations were predicted to be an accident while it did not result in one. These errors result from the fact that only 4% of the violations in our violation dataset actually result in an accident. However, we do have a good accuracy and believe that as we have more data of accidents in our dataset, it would enable the model to learn more and produce a better result for all cases in the future.

Now that we have our model ready, we can input a given set of data in order to identify if that particular driver, vehicle combination is likely to cause an accident or not. To demonstarte this, we have used the demographics of a Male, Hispanic driver from Maryland, holding a Maryland driving license who is driving a 2006 make Toyota.

In [48]:
#Find the output of our model based on a given set of inputs
race = 'HISPANIC'
gender = 'M'
state = 'MD'
year = '2006'
make = 'TOYOTA'
dl_state = 'MD'
test_data = []

race_map = {"BLACK": 0, 'WHITE':1, "HISPANIC": 2, "ASIAN":3, 'NATIVE AMERICAN':4 , 'OTHER':5}
if race in race_map.keys():
    test_data.append(race_map[race])

gender_map = {"M": 2, "F": 1, "U":0}
if gender in gender_map.keys():
    test_data.append(gender_map[gender])


state_map = {"MD": 1}
if state in state_map.keys():
    test_data.append(state_map[state])
else:
    test_data.append(0)

test_data.append(year)
    
if dl_state in state_map.keys():
    test_data.append(state_map[dl_state])
else:
    test_data.append(0)

model_map = {"TOYOTA": 1, 'HONDA':2, 'FORD': 3, 'NISSAN':4, 'CHEVROLET': 5, 'HYUNDAI':6,'DODGE':7, 'ACURA':8, 'MERCEDES':9, 'BMW':10, 'VOLKSWAGEN':11, 'JEEP':12, 'LEXUS':13, 'MAZDA':14, 'SUBARU':15}
if make in model_map.keys():
    test_data.append(model_map[make])
else:
    test_data.append(0)
    
y_pred = forest.predict([test_data])

gender_map = {"M": "Male", "F": "Female", "U":"Unknown"}
if gender in gender_map.keys():
    gender = gender_map[gender]
    
accident_map = {0: " not", 1: ""}
decision = accident_map[y_pred[0]]

print(f'When a {gender}, {race} driver from {state} with their driving licence state from {dl_state} is driving a {year} model {make}, the driver is{decision} likely to cause an accident')
When a Male, HISPANIC driver from MD with their driving licence state from MD is driving a 2006 model TOYOTA, the driver is not likely to cause an accident
C:\Users\Admin\anaconda3\lib\site-packages\sklearn\base.py:450: UserWarning:

X does not have valid feature names, but RandomForestClassifier was fitted with feature names

Conclusions¶

Through our analysis, the authorities will find many things helpful. 1) We were able to see which location had the most accidents being made and hence the police can be more vigilent and increase the traffic rules in those areas.

2) Racial bias should be decreased and strict measures should be taken on the police officer as well if any such false accusations are made just on the basis of race. This will help the community to also be free and safe.

3) Female warning rate is higher than males which shows a difference in treatment when police assess whether or not to give a warning. Thus theres a possibility that females take traffic violations leniently and department should look into it in order to enforce stricter traffic laws on everyone equally. One thing they can do about it is training officers not to give preferential treatment through decentives such as docks in pay when preferential instances are shown.

4) Male Hispanics are causing the most number of accidents in comparison to any other race/gender and hence these matters should be addressed and department should look more into it.

5) The predictions will also help that okay what type of make, model, race and gender will cause an accident in the future or not. This can help us follow a trend that if some particular make or model of car is causing most accidents so then that car has some issue.

6) This analysis will help the community be safe and hopefully help reduce accidents.

References¶

[1] https://www.census.gov/quickfacts/MD

[2] https://www.faa.gov/air_traffic/publications/atpubs/cnt_html/appendix_a.html

[3] https://towardsdatascience.com/the-next-level-of-data-visualization-in-python-dd6e99039d5e

[4] https://towardsdatascience.com/all-machine-learning-models-explained-in-6-minutes-9fe30ff6776a